This project developed a method of predicting bikeshare trips in New York City with emphasis on the effect of bicycle infrastructure, such as bike lanes. We used our modeling predictions to explore bike planning scenarios in New York City. This project is one of the first to use data capturing the unique routes taken by each bikeshare trip, so the methods used herein could be applied to other similar datasets. This document will outline the motivation for our project, exploratory analysis, feature engineering and model building, and the scenario planning tool displayed in our application.
This project was completed for the final capstone in the Master of Urban Spatial Analytics program at the University of Pennsylvania. We are very thankful to our instructors, Ken Steif, Michael Fichman, and Matt Harris, for their insight and patient guidance as we undertook this large task. We are also incredibly grateful and indebted to our clients, Kyle Yakal-Kremski, Apaar Bansal, and Paula Rubira from the New York City Department of Transportation, for their generosity in sharing such a unique dataset with a group of students and working with us through this process.
Riding a bike in a city can be a harrowing task of dodging turning vehicles, weaving around parked cars and delivery trucks, and watching out for a car door that may open any moment. Bike lanes are an important way to carve out space for cyclists to improve the comfort and safety of riding. However, it can be a very challenging task to build broad support for bike lanes and decide which streets should receive bike lanes first. Our project provides a new way of predicting the impact of a bike lane based on unique ridership data from the Citibike system in New York City, in which people rent bikes from docked stations.
A transportation planner’s task of deciding where to install a new bike lane is a challenging one. Planners must create a network throughout the city that allows cyclists to flow between destinations and protects them from the dangers of the road. Planners may prioritize streets for bike lane investments based on the number of collisions between cars and cyclists or the value of a certain road to the network. There are also strategic factors, such as a street repaving, that can present a convenient time for lane reconfiguration. It is also crucial to develop relationships in communities where new bike lanes will be installed to build support for the changes and ensure community members know how to use the new cycling infrastructure.
This project provides a unique way of prioritizing streets for bike lane investments based on observed bikeshare ridership. Predicting the impact of a new bikelane on bike ridership for would give our client, the NYC DOT, an additional tool to use in deciding which streets to prioritize for bike lanes. Furthermore, if planners can explain to a community board the most important streets for adding a bike lane or quantify the impact of adding a more protected facility, it could help advocate for the most needed changes or develop relationships with groups who are more skeptical of bike lanes.
To develop this model, we relied on two main data sources. New York City’s street centerlines dataset shows the type of bike lane on any given street and where the bike lane network has expanded over time. We combined this with trip data from the bikeshare system, Citibike, in which people can rent bikes at docked stations. To predict bikeshare ridership, we relied on additional data sources including American Community Survey 2015-2019 Estimates, Open Street Map amenity data, and additional amenity datasets from New York City Open Data.
This project is unique because it relies on a very uncommon type of trip data that is not usually shared with the public. Most publicly available data on docked bikeshare systems is origin-destination data, meaning you know the stations where a trip starts and ends but not the route the cyclist took in between. Since 2018, Citibike has allowed riders to opt-in to having their trip tracked via their phone’s GPS. The resulting data, rather than being a pair of stations, is a line that shows the exact route a rider took. This allows much greater insight into the choices cyclists make when they ride and allows us to calculate the ridership of bikeshare trips on any given block.
New York City uses three categories for recording bike lanes, which we have termed Protected Lanes, Unprotected Lanes, and Sharrows. The highest level of protection recorded in the data is a protected lane, which is separated from vehicle lanes by parked cars or another kind of buffer to create a dedicated space for cyclists. An unprotected lane is one in which the bike lane is painted onto the street immediately adjacent to the vehicle lane. Even though this is a bike lane, it offers less protection because vehicles can easily wander into the bike lane. The final category, sharrow, is used when streets are not wide enough for separate vehicle and bike lanes. The term sharrow refers to the symbol of an arrow with a cycle and is meant to encourage drivers to share the lane with cyclists. Since the sharrow is very similar to a street without a bike lane, we decided to exclude it from our category of bike lane.
| Protected Lane | Unprotected Lane | Sharrow |
|---|---|---|
Since the the Citibike network does not extend throughout all of New York, we defined a study area based on the time frame used by our predictive model: August 2020. We chose this month because it captures the warm weather of the summer and includes stations in the Bronx, which was only added to the Citibike network starting in July 2020. Our study area therefore comprises a half mile buffer around the Citibike stations that were available in August of 2020. This study area extends for most of Manhattan, and parts of the Bronx, Brooklyn, and Queens.
Converting individual bikeshare trips into useful ridership data is a very challenging task. Since this is such a novel analysis, recording our data cleaning process will be helpful to anyone pursuing a similar project in the future. The steps we pursued are as follows:
Buffer Width
For ease of calculation, we chose a single buffer width of 15 feet. When buffers for different street segments overlap, assigning each point to only one of those buffers is more time-consuming in R. Using a buffer of 15 feet allowed us to capture most points and minimize overlap between buffers. However, many streets are more than 30 feet wide, so this method could miss some ridership. Using a variable buffer width based on street width could improve this calculation in the future.
Lines to Points
The built-in tool in ArcMap to Generate Points Along Lines is meant to place a point along each line at a specified distance. However, we found that with a dataset of 100,000 trips, the resulting data was very spatially inconsistent, with large peaks of 300-400 riders while no other adjoining street had counts higher than 30 trips. This suggested to us that the initial ArcMap tool was not generating points in the way it had described. We performed a second points generation using a custom tool from a Stack Exchange post, which resulted in much more spatially consistent ridership counts. However, there are still some segments where the count on one segment is several hundred trips higher than any adjoining segment. Future analyses should investigate this step and find the optimal way to generate points along lines, either in ArcMap, ArcGIS Pro, or R.